Skip to content

Conversation

@Andrurachi
Copy link
Contributor

@Andrurachi Andrurachi commented Dec 5, 2025

Summary

Introduces a parallelized path for HashedPostStateSorted::from_reverts using rayon, gated behind a new parallel-from-reverts feature flag.

This optimization targets CPU bottlenecks caused by deep reorgs or blocks with skewed storage distributions. It implements an Account Count Threshold (2,500) to ensure no regressions on small or standard blocks.

Motivation

Closes #20049

Processing large state reverts involves two distinct phases:

  1. DB Cursor Walk (I/O Bound): Sequential reading of account/storage changesets.
  2. Sorting & Hashing (CPU Bound): Organizing the raw data into a sorted state.

While the DB walk must remain sequential, the sorting phase becomes a bottleneck when thousands of accounts (or accounts with massive storage slots) need processing. This PR parallelizes the CPU-bound sorting phase, reducing wall-time for heavy blocks.

Implementation Details

  1. Feature Flag: Added parallel-from-reverts (opt-in).
  2. Strategy: Parallelize the outer loop (iterating over accounts) rather than the inner loop (sorting slots). Benchmarks proved that inner-loop parallelization (par_sort_unstable) introduced too much overhead for typical slot counts (< 1,000).
  3. Threshold: A threshold of 2,500 Accounts was established via micro-benchmarking.
    • < 2,500: Always Sequential (avoids Rayon overhead).
    • >= 2,500: Parallel (distributes load, handles skewed accounts).
  4. Logic Preservation: Preserves the existing eager hashing logic (hashing keys during collection) to ensure cache locality and consistency with the sequential path.

Benchmarks

1. Micro-Benchmarks (sorting_par_exp)

Measured purely the sorting/hashing overhead (excluding DB reads).

Scenario Accounts Slots/Acc Seq Time Par Time Speedup
Acc_Low 100 4 8 µs 124 µs -15x
Acc_Med 1,000 4 73 µs 122 µs -0.6x
Acc_High 10,000 4 790 µs 612 µs 1.3x
Skewed_10k 10k Mixed 64 ms 14 ms 4.6x

Skewed Distribution: 95% accounts have 4 slots, 5% have 2,000 slots.*

2. Integration Benchmarks (integration_bench)

Measured full lifecycle: DB Read -> Hashing -> Sorting -> Allocation.

Scenario Description Seq Time Par Time Delta
Small 1k Acc (Below Threshold) 61.2 ms 61.0 ms 0% (No Regression)
Large_Uniform 10k Acc (20 slots) 270 ms 267 ms ~1.2%
Large_Skewed 10k Acc (Whale Dist.) 1,439 ms 1,363 ms ~5.3%

Since the total runtime is dominated by DB I/O, this actually represents a solid optimization of the available CPU-bound work. The threshold ensures no regression for small blocks.

Checklist

  • Added rayon dependency (optional).
  • Implemented parallel-from-reverts feature flag.
  • Added benchmarks covering Uniform, Skewed, and Threshold scenarios.
  • Verified no regressions for inputs below the 2,500 account threshold.

cfg!(feature = "parallel-from-reverts") && storages.len() >= PARALLEL_THRESHOLD;

let hashed_storages = if use_parallel {
#[cfg(feature = "parallel-from-reverts")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we feature-flag rayon usage behind std, imo we could do that here and not introduce a new feature flag.

#[cfg(not(feature = "std"))]
// If nostd then serially reveal storage proof nodes for each storage trie
{
for (account, storage_subtree) in storages {
self.reveal_decoded_storage_multiproof(account, storage_subtree)?;
}
Ok(())
}
#[cfg(feature = "std")]
// If std then reveal storage proofs in parallel
{

@mediocregopher
Copy link
Collaborator

@yongkangc let's rbc this as well to confirm there's no regression 🙏

@yongkangc
Copy link
Member

yongkangc commented Dec 10, 2025

hey @Andrurachi thanks for the pr, could u rebase / merge main in so that we can bench this as well?

the reason is because we recently introduced a breaking change to MerkleChangeSets that requires merge of main for benching

Parallelizes the hashing/sorting step using Rayon when account count
exceeds a threshold (2500). This alleviates CPU bottlenecks during large
state reverts or deep reorgs.

Closes paradigmxyz#20049
@Andrurachi Andrurachi force-pushed the perf-parallel-reverts-sorting-20049 branch from b2011f6 to 1843db5 Compare December 10, 2025 04:26
@Andrurachi
Copy link
Contributor Author

@yongkangc I've rebased on main to bring the breaking changes in MerkleChangeSets. The PR is compiling and benchmarks are passing. I'm using cargo bench -p reth-trie-db --bench integration_bench and cargo bench -p reth-trie-db --bench integration_bench --features parallel-from-reverts to run benchmarks.

Regarding moving the feature flag to std: I tried replacing parallel-from-reverts with the standard default = ["std"] pattern, but I hit a wall of Zepter errors regarding feature propagation.

To avoid pushing broken config or making you wait more than needed, I've kept parallel-from-reverts for this push. If required, could you please provide some advise on the correct Cargo.toml setup to satisfy Zepter for this specific crate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

perf: parallelize HashedPostStateSorted::from_reverts hashing/sorting

3 participants